Table of Contents¶

  1. What Is a Random Variable?
  2. What Is a Parametric Distribution?
  3. Bernoulli Distribution
  4. Binomial Distribution
  5. Poisson Distribution
  6. Gaussian (Normal) Distribution-Distribution)
  7. Student’s t‑Distribution
  8. Central Limit Theorem in Hypothesis Testing 8.1. Sampling Distribution and the CLT 8.2. Central Limit Theorem in Hypothesis Testing 8.3. One‑Sample z‑Test for a Mean 8.4. One‑Sample z‑Test for a Proportion
In [1]:
# === Cell 1: Imports ===
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

Preliminary Knowledge: Statistics (Part II) - Interactive Notebook¶

This notebook provides a simplified overview of the statistical concepts including common distributions, central limit theorem, and hypothesis tests, along with interactive Python demonstrations.

1. What Is a Random Variable?¶

A random variable $X$ is a measurable function that assigns each outcome $\omega$ in a probability space $(\Omega,\mathcal{F},P)$ a real value $X(\omega)$, providing a formal mathematical description of randomness.

Random variables are classified as discrete (taking values in a countable set) or continuous (taking values in an uncountable interval) depending on their range.

The distribution of $X$ is the induced probability measure $P_X$, described by a probability mass function (PMF) $p_X(k)=P(X=k)$ for discrete $X$ or a probability density function (PDF) $f_X(x)$ for continuous $X$ such that $P(a\leq X\leq b)=\int_a^b f_X(x)\,dx$.

2. What Is a Parametric Distribution?¶

A parametric distribution belongs to a family $\{P_\theta:\theta\in\Theta\}$ of probability distributions, each member of which is completely specified by a finite-dimensional parameter vector $\theta$.

The parameter vector $\theta$ (e.g., mean $\mu$, variance $\sigma^2$, rate $\lambda$, success probability $p$) determines the shape and scale of the distribution, while the parameter space $\Theta$ is the set of all allowable values for $\theta$ (e.g., $\mu\in\mathbb{R}$, $\sigma>0$). Because $\Theta$ is finite-dimensional, parametric distributions enable efficient inference and hypothesis testing by estimating a small number of parameters.

  • Bernoulli($p$): $X\in\{0,1\}$, $P(X=1)=p$ and $P(X=0)=1-p$.
  • Binomial($n,p$): Number of successes in $n$ i.i.d. Bernoulli($p$) trials.
  • Poisson($\lambda$): Count of events in a fixed interval at rate $\lambda$.
  • Normal($\mu,\sigma^2$): Continuous, bell‑shaped PDF with mean $\mu$ and variance $\sigma^2$.
  • Student’s t($\nu$): Heavy‑tailed continuous family with $\nu$ degrees of freedom.

3. Bernoulli Distribution¶

  • Definition:
    A Bernoulli random variable $X$ takes value 1 (“success”) with probability $p$ and 0 (“failure”) with probability $1-p$.
  • Parameters:
    $p \in [0,1]$.
  • PMF:
    $P(X=x) = p^x (1-p)^{1-x},\quad x \in \{0,1\}$
  • Moments:
    $\mathbb{E}[X]=p,\quad \mathrm{Var}(X)=p(1-p)$

Examples of random processes following a bernoulli distribution:

  • Coin Toss

    • Tossing a (possibly biased) coin once.
    • Outcome: Heads = 1, Tails = 0.
  • Email Spam Detection

    • Classifying a single email as spam or not.
    • Outcome: Spam = 1, Not spam = 0.
  • Light Bulb Testing

    • Checking if a newly made light bulb works.
    • Outcome: Working = 1, Not working = 0.
  • Web Ad Click

    • Determining if a user clicks on an online ad.
    • Outcome: Clicked = 1, Didn't click = 0.
  • Customer Purchase Decision

    • Will a customer make a purchase after seeing a product?
    • Outcome: Purchase = 1, No purchase = 0.
In [2]:
p = 0.3
n_samples = 1000
data_bern = np.random.binomial(n=1, p=p, size=n_samples)
emp_mean = data_bern.mean()
emp_var  = data_bern.var(ddof=0)
print(f"Bernoulli(p={p}): Empirical mean={emp_mean:.3f}, var={emp_var:.3f}")
# PMF plot
x = [0,1]
pmf = [ (1-p), p ]
plt.figure()
plt.bar(x, pmf, alpha=0.6, label='Theoretical PMF')
plt.hist(data_bern, bins=[-0.5,0.5,1.5], density=True,
         alpha=0.4, label='Empirical')
plt.xticks(x)
plt.title("Bernoulli Distribution PMF vs. Empirical")
plt.legend()
plt.show()
Bernoulli(p=0.3): Empirical mean=0.298, var=0.209
No description has been provided for this image

4. Binomial Distribution¶

  • Definition:
    The number of successes in $n$ independent Bernoulli trials with success probability $p$.
  • Parameters:
    $n \in \mathbb{N}$, $p \in [0,1]$.
  • PMF:
    $P(X=k) = \binom{n}{k} p^k(1-p)^{n-k},\quad k=0,1,\dots,n.$
  • Moments:
    $\mathbb{E}[X]=np,\quad \mathrm{Var}(X)=np(1-p).$

Examples of random processes following a binomial distribution:

  • Multiple Coin Tosses

    • Tossing a (possibly biased) coin n times.
    • Outcome: Number of heads (successes) in n tosses.
  • Batch of Emails Classified

    • Checking n emails for spam.
    • Outcome: Number of spam emails detected.
  • Light Bulb Quality Check

    • Testing n light bulbs in a batch.
    • Outcome: Number of working bulbs (successes) out of n.
  • Online Ad Campaign

    • Showing an ad to n users.
    • Outcome: Number of users who click the ad.
  • Customer Purchase Behavior

    • Observing n customers visiting a store.
    • Outcome: Number of customers who make a purchase.
In [3]:
n, p = 10, 0.4
data_binom = np.random.binomial(n=n, p=p, size=n_samples)
print(f"Binomial(n={n}, p={p}): Emp. mean={data_binom.mean():.3f}, var={data_binom.var(ddof=0):.3f}")
k = np.arange(0, n+1)
pmf_binom = stats.binom.pmf(k, n, p)
plt.figure()
plt.bar(k, pmf_binom, alpha=0.6, label='Theoretical PMF')
plt.hist(data_binom, bins=np.arange(-0.5,n+1.5), density=True,
         alpha=0.4, label='Empirical')
plt.title("Binomial Distribution PMF vs. Empirical")
plt.legend()
plt.show()
Binomial(n=10, p=0.4): Emp. mean=4.027, var=2.392
No description has been provided for this image

5. Poisson Distribution¶

  • Definition:
    Models the count of events in a fixed interval when events occur independently at constant rate $\lambda$.
  • Parameter:
    $\lambda > 0$.
  • PMF:
    $P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!},\quad k=0,1,2,\dots.$
  • Moments:
    $\mathbb{E}[X]=\mathrm{Var}(X)=\lambda.$

Examples of random processes following a poisson distribution:

  • Number of Emails Received per Hour

    • Counting how many emails arrive in an hour.
    • Events occur independently at a constant average rate.
  • Customer Arrivals at a Store

    • Measuring how many customers enter a store in a given time period.
    • Useful when arrivals are random and independent.
  • Calls Received at a Call Center

    • Number of incoming calls per minute/hour.
    • Assumes calls occur randomly and independently.
  • Typing Errors in a Document

    • Counting the number of typos per page in a long text.
    • Events (errors) are rare and occur independently.
  • Traffic Accidents at an Intersection

    • Number of accidents at a specific intersection per month.
    • Models rare events over a fixed interval of time or space.
In [4]:
lam = 3.0
data_pois = np.random.poisson(lam=lam, size=n_samples)
print(f"Poisson(λ={lam}): Emp. mean={data_pois.mean():.3f}, var={data_pois.var(ddof=0):.3f}")
k = np.arange(0, np.max(data_pois)+1)
pmf_pois = stats.poisson.pmf(k, lam)
plt.figure()
plt.bar(k, pmf_pois, alpha=0.6, label='Theoretical PMF')
plt.hist(data_pois, bins=np.arange(-0.5,np.max(data_pois)+1.5),
         density=True, alpha=0.4, label='Empirical')
plt.title("Poisson Distribution PMF vs. Empirical")
plt.legend()
plt.show()
Poisson(λ=3.0): Emp. mean=3.048, var=3.148
No description has been provided for this image

6. Gaussian (Normal) Distribution¶

  • Definition:
    A continuous distribution for real $x$, with bell‑shaped PDF: $ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) $
    where $\mu \in \mathbb{R}$ and $\sigma > 0$.
  • Parameters:
    Mean $\mu$, standard deviation $\sigma$.
  • Moments:
    $\mathbb{E}[X]=\mu,\quad \mathrm{Var}(X)=\sigma^2.$

In the case of $\mu=0$ and $\mathrm{Var}=1$, it is called a standard normal distribution.

Standardization Formula¶

Given a normally distributed random variable $ X $ with mean $ \mu $ and standard deviation $ \sigma $, the standardized variable $ Z $ is calculated as:

$ Z = \frac{X - \mu}{\sigma} $

This transformation shifts the distribution so that it has a mean of 0 and scales it so that it has a standard deviation of 1. The resulting variable $ Z $ follows a standard normal distribution, denoted as $ \mathcal{N}(0, 1) $.

Examples of random processes following a Gaussian (normal) distribution

  • Heights of People

    • The distribution of adult human heights within a population.
    • Most people are around the average height, with fewer at the extremes.
  • Exam Scores

    • Scores on standardized tests like the SAT or IQ tests.
    • Usually centered around a mean, with symmetric spread.
  • Measurement Errors

    • Errors in scientific measurements due to random noise.
    • Small errors are more common; large errors are rare.
  • Blood Pressure Levels

    • Systolic blood pressure in a healthy population.
    • Follows a bell-shaped curve around the average.
  • Daily Temperature (Under Certain Conditions)

    • Daily temperatures in a specific location during a stable season.
    • Typically distributed around a seasonal average.
In [5]:
mu, sigma = 0, 1
data_norm = np.random.normal(loc=mu, scale=sigma, size=n_samples)
print(f"Normal(μ={mu},σ={sigma}): Emp. mean={data_norm.mean():.3f}, var={data_norm.var(ddof=0):.3f}")
x = np.linspace(-4, 4, 200)
pdf_norm = stats.norm.pdf(x, mu, sigma)
plt.figure()
plt.plot(x, pdf_norm, label='Theoretical PDF')
plt.hist(data_norm, bins=30, density=True, alpha=0.4,
         label='Empirical')
plt.title("Normal Distribution PDF vs. Empirical")
plt.legend()
plt.show()
Normal(μ=0,σ=1): Emp. mean=-0.020, var=1.029
No description has been provided for this image

5. Student’s t‑Distribution¶

  • Definition:
    A continuous, symmetric distribution with heavier tails than normal, parameterized by degrees of freedom ($\nu$).
  • Parameter:
    $\nu$>0.
  • PDF:
    $ f(x) = \frac{\Gamma\bigl(\tfrac{\nu+1}{2}\bigr)}{\sqrt{\nu\pi}\,\Gamma\bigl(\tfrac{\nu}{2}\bigr)} \Bigl(1+\frac{x^2}{\nu}\Bigr)^{-\frac{\nu+1}{2}}. $
  • Moments:
    $\mathbb{E}[X]=0$ for $\nu>1$, $\mathrm{Var}(X)=\tfrac{\nu}{\nu-2}$ for $\nu$>2.

Examples of random processes following a Student's t-Distribution

  • Estimating the Mean from a Small Sample

    • Calculating the confidence interval for the mean of a small dataset.
    • Especially when population standard deviation is unknown.
  • Comparing Two Small Sample Means

    • Performing a t-test to check if two small groups have significantly different means.
    • Common in medical trials or psychology experiments.
  • Quality Control with Limited Data

    • Evaluating product consistency when only a few samples are available.
    • Helps model uncertainty in the mean.
  • Analyzing Small Clinical Trials

    • Comparing treatment effects in experiments with few participants.
    • Used in early-phase drug studies.
  • Economics or Business A/B Testing

    • Testing marketing strategies or product changes with small sample sizes.
    • t-distribution accounts for sample variability.
In [6]:
# === Cell 6: Student's t‑Distribution Demo ===
nu = 5
data_t = stats.t.rvs(df=nu, size=n_samples)
print(f"Student’s t (ν={nu}): Emp. mean={data_t.mean():.3f}, var={data_t.var(ddof=0):.3f}")
x = np.linspace(-5, 5, 200)
pdf_t = stats.t.pdf(x, nu)
plt.figure()
plt.plot(x, pdf_t, label='Theoretical PDF')
plt.hist(data_t, bins=30, density=True, alpha=0.4,
         label='Empirical')
plt.title("Student’s t PDF vs. Empirical")
plt.legend()
plt.show()
Student’s t (ν=5): Emp. mean=-0.020, var=1.557
No description has been provided for this image

6. Central Limit Theorem in Hypothesis Testing¶

6.1. Sampling Distribution and the CLT¶

  • Statement:
    If $X_1, \dots, X_n$ are i.i.d. with mean $\mu$ and variance $\sigma^2$, then for large $n$, $ \bar X = \frac1n \sum_{i=1}^n X_i \;\approx\; N\!\Bigl(\mu,\,\frac{\sigma^2}{n}\Bigr). $
  • Consequence:
    Regardless of the original distribution’s shape, the distribution of $\bar X$ approaches normality as $n\to\infty$.
In [7]:
# Cell 2: One-sample z-test function
def one_sample_z_test(data, mu0, sigma):
    n = len(data)
    xbar = np.mean(data)
    z = (xbar - mu0) / (sigma / np.sqrt(n))
    p_val = 2 * (1 - stats.norm.cdf(abs(z)))
    return z, p_val

# Cell 3: Simulation function for sampling distribution
def simulate_sampling_dist(dist_func, params, n, n_sim=10000):
    return np.array([np.mean(dist_func(*params, size=n)) for _ in range(n_sim)])

# Cell 4: Plotting routine
def plot_sampling_distributions(sample_sizes, dist_func, params, scale):
    plt.figure(figsize=(12, 8))
    
    for i, n in enumerate(sample_sizes, 1):
        means = simulate_sampling_dist(dist_func, params, n)
        mu, sd = np.mean(means), np.std(means)
        plt.subplot(2, 2, i)
        plt.hist(means, bins=30, density=True, alpha=0.7)
        x = np.linspace(mu - 3*sd, mu + 3*sd, 200)
        plt.plot(x, stats.norm.pdf(x, mu, sd))
        plt.title(f'n={n} → Mean≈{mu:.2f}, SD≈{sd:.2f}')
        plt.xlabel('Sample Mean')
        plt.ylabel('Density')
        plt.xlim([0, 2])
    plt.tight_layout()
    plt.show()

# Cell 5: Demo run
sample_sizes = [5, 30, 100,1000]
plot_sampling_distributions(sample_sizes, np.random.poisson, (1.0,), scale=None)
No description has been provided for this image

6.2 Central Limit Theorem in Hypothesis Testing¶

CLT states that $ \bar X = \frac1n \sum_{i=1}^n X_i \;\approx\; N\!\Bigl(\mu,\,\frac{\sigma^2}{n}\Bigr). $

Now, how to judge if the calculated $\bar X$ is sampled from $N\!\Bigl(\mu,\,\frac{\sigma^2}{n}\Bigr)$.

Let's assume that $\mu=0$ and $\frac{\sigma^2}{n}=1$, considering the following cases of $\bar X$:

  • -3
  • -0.5
  • 0.5
  • 3

Can we say the above cases of $\bar X$ are sampled from $N\!\Bigl(\mu=0,\,\frac{\sigma^2}{n} = 1)$ ?

In [8]:
# Plotting the standard normal distribution
x = np.linspace(-4, 4, 200)
pdf_standard_norm = stats.norm.pdf(x, loc=0, scale=1)

plt.figure()
plt.plot(x, pdf_standard_norm, label='Standard Normal Distribution')

# Adding vertical lines at -1, -0.1, 0.1, and 1
for vline in [-3, -0.5, 0.5, 3]:
    plt.axvline(x=vline, color='red', linestyle='--', label=f'x_bar={vline}, pdf={stats.norm.pdf(vline):.3f}')

plt.title("Standard Normal Distribution")
plt.xlabel("x")
plt.ylabel("Density")
plt.legend()
plt.show()
No description has been provided for this image

As we can see above, some cases (-3, or 3) of $\bar X$ are very unlikely (i.e., low density regions). As a practical rule of thumb for researchers, if $\bar X$ falls outside 95% confidence interval (below) or inside a critical region ($\alpha=5\%$) (below), we conclude that that $\bar X$ is not sampled from the $N\!\Bigl(\mu=0,\,\frac{\sigma^2}{n} = 1)$ and, otherwise, yes.

In [9]:
x = np.linspace(-4, 4, 200)
pdf_standard_norm = stats.norm.pdf(x, loc=0, scale=1)
ci_lower, ci_upper = -1.96, 1.96
plt.figure()
plt.plot(x, pdf_standard_norm, label='Standard Normal Distribution')

# Adding vertical lines at -1, -0.1, 0.1, and 1
for vline in [-3, -0.5, 0.5, 3]:
    plt.axvline(x=vline, color='red', linestyle='--', label=f'x={vline}')

# Adding 95% confidence interval lines
for ci in [-1.96, 1.96]:
    plt.axvline(x=ci, color='blue', linestyle='-.', label=f'95% CI: x={ci}')

# Shade the area outside the confidence interval
x_outside_lower = np.linspace(-4, ci_lower, 100)
x_outside_upper = np.linspace(ci_upper, 4, 100)
plt.fill_between(x_outside_lower, stats.norm.pdf(x_outside_lower), color='red', alpha=0.5, label='Outside CI')
plt.fill_between(x_outside_upper, stats.norm.pdf(x_outside_upper), color='red', alpha=0.5)

# Add confidence interval lines
#plt.axvline(x=ci_lower, color='blue', linestyle='-.', label=f'CI Lower: x={ci_lower}')
#plt.axvline(x=ci_upper, color='blue', linestyle='-.', label=f'CI Upper: x={ci_upper}')

plt.title("Standard Normal Distribution with Vertical Lines and 95% CI")
plt.xlabel("x")
plt.ylabel("Density")
plt.legend()
plt.show()
No description has been provided for this image

$\bar X = -3$ or $\bar X = 3$ are outside the confidence interval of 95%, while $\bar X = -0.5$ or $\bar X = 0.5$ are inside. $\bar X = -0.5$ or $\bar X = 0.5$ are within the confidence limit that they are sampled from $N\!\Bigl(\mu=0,\,\frac{\sigma^2}{n} = 1)$.

This leads us to the notion of hypothesis testing.

6.3. One‑Sample z‑Test for a Mean¶

  • Null hypothesis: $H_0\colon \mu = \mu_0$.
  • Test statistic (known $\sigma$): $ z \;=\;\frac{\bar X - \mu_0}{\sigma / \sqrt{n}} \sim N(0,1)\;\text{under }H_0. $
  • Decision rule: Reject $H_0$ if $\lvert z\rvert > z_{\alpha/2}$.

$\alpha$ is the significant level (usually 5%).

Example case of z-Test:

Suppose a company claims that its light bulbs have an average lifespan of 1,000 hours. A quality control analyst selects a random sample of 50 bulbs and finds a sample mean lifespan of 980 hours. The population standard deviation is known to be 60 hours. We want to test whether this sample provides enough evidence to conclude that the actual mean lifespan differs from the company's claim.

6.4. One‑Sample z‑Test for a Proportion¶

  • Statistic: $\hat p$ from $n$ Bernoulli trials.
  • Null hypothesis: $H_0\colon p = p_0$.
  • Test statistic: $ z \;=\;\frac{\hat p - p_0}{\sqrt{p_0(1-p_0)/n}} \approx N(0,1) $ for large $n$.

Example case:

Suppose a company claims that 80% of its customers are satisfied with their service. A recent survey of 100 customers found that 73% reported being satisfied. We want to test whether this sample provides enough evidence to conclude that the actual proportion of satisfied customers differs from the company's claim.

In [ ]: